Paper 1224-2017: A Moment-Matching Approach for Generating Synthetic Data in SAS®
نویسنده
چکیده
Disseminating data to potential collaborators can be essential in the development of models, algorithms, and innovative research opportunities. However, it is often time consuming to get approval to access sensitive data such as health data. An alternative to sharing the real data is to use synthetic data, which has similar properties to the original data but does not disclose sensitive information. The collaborators can use the synthetic data to make preliminary models or work out bugs in their code while waiting to get approval to access the original data. A data owner can also use the synthetic data to crowdsource solutions from the public through competitions like Kaggle and then test those solutions on the original data. This paper implements a method that generates fully synthetic data in a way that matches the statistical moments of the true data up to a specified moment order as a SAS macro. Variables in the synthetic data set are of the same data type as the true data (for example, integer, binary, continuous). The implementation uses the linear programming solver within a column generation algorithm and the mixed integer linear programming solver from the OPTMODEL procedure in SAS/OR. The COFOR statement in PROC OPTMODEL automatically parallelizes a portion of the algorithm. This paper demonstrates the method by using the Sashelp.Heart data set to generate fully synthetic data copies.
منابع مشابه
Sampling-Based Speech Parameter Generation Using Moment-Matching Networks
This paper presents sampling-based speech parameter generation using moment-matching networks for Deep Neural Network (DNN)-based speech synthesis. Although people never produce exactly the same speech even if we try to express the same linguistic and para-linguistic information, typical statistical speech synthesis produces completely the same speech, i.e., there is no inter-utterance variatio...
متن کاملA Fully Integrated Approach for Better Determination of Fracture Parameters Using Streamline Simulation; A gas condensate reservoir case study in Iran
Many large oil and gas fields in the most productive world regions happen to be fractured. The exploration and development of these reservoirs is a true challenge for many operators. These difficulties are due to uncertainties in geological fracture properties such as aperture, length, connectivity and intensity distribution. To successfully address these challenges, it is paramount to im...
متن کاملAdversarial Feature Matching for Text Generation
The Generative Adversarial Network (GAN) has achieved great success in generating realistic (realvalued) synthetic data. However, convergence issues and difficulties dealing with discrete data hinder the applicability of GAN to text. We propose a framework for generating realistic text via adversarial training. We employ a long shortterm memory network as generator, and a convolutional network ...
متن کاملEvaluation of Similarity Measures for Template Matching
Image matching is a critical process in various photogrammetry, computer vision and remote sensing applications such as image registration, 3D model reconstruction, change detection, image fusion, pattern recognition, autonomous navigation, and digital elevation model (DEM) generation and orientation. The primary goal of the image matching process is to establish the correspondence between two ...
متن کاملAn Improved Semantic Schema Matching Approach
Schema matching is a critical step in many applications, such as data warehouse loading, Online Analytical Process (OLAP), Data mining, semantic web [2] and schema integration. This task is defined for finding the semantic correspondences between elements of two schemas. Recently, schema matching has found considerable interest in both research and practice. In this paper, we present a new impr...
متن کامل